fix: coerce judge score drift by schultzjack · Pull Request #756 · NVIDIA-NeMo/DataDesigner

schultzjack · 2026-06-17T00:30:59Z

Summary

normalize LLM-judge score values before enum validation in generated judge response models
accept numeric/string drift and simple case/whitespace drift when it maps unambiguously to a configured score option
keep unmatched or malformed scores on the existing Pydantic validation path

Scope

This addresses the LLM-judge validation path discussed in #569. It intentionally leaves the broader LLM-structured schema coercion path unchanged.

Testing

uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py -q
uv run --group dev pytest packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py packages/data-designer-engine/tests/engine/column_generators/utils/test_prompt_renderer.py packages/data-designer-engine/tests/engine/column_generators/generators/test_llm_completion_generators.py packages/data-designer-engine/tests/engine/models/recipes/test_response_recipes.py -q
make check-engine
make test-engine

Fixes #569

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>

github-actions · 2026-06-17T00:31:12Z

All contributors have signed the DCO ✍️ ✅
_{Posted by the DCO Assistant Lite bot.}

schultzjack · 2026-06-17T00:33:44Z

I have read the DCO document and I hereby sign the DCO.

schultzjack · 2026-06-17T00:34:03Z

recheck

greptile-apps · 2026-06-17T00:34:52Z

Greptile Summary

This PR fixes LLM-judge score drift by adding a mode="before" Pydantic model validator on BaseJudgeResponse that normalises incoming score values (numeric↔string, case, whitespace) before they reach enum validation, falling back to the standard Pydantic path when the coercion is ambiguous or the value is malformed.

_normalize_score_value converts values to a stripped, casefolded string (integer floats become their integer string), enabling reliable equality comparison across types.
_coerce_score_value performs an exact-match pass first (with a bool-guard to prevent True/False from silently matching 1/0), then falls back to normalised matching only when exactly one enum member maps to the same normalised form.
Four tests covering the primary drift scenarios, nested-model coercion, and unhashable-value fallthrough are added.

Confidence Score: 5/5

The change is self-contained to the judge score coercion path and introduces no mutations to unrelated model behaviour.

The coercion logic is narrowly scoped: it only fires when the field annotation is a concrete Enum subclass, the input is a dict, and exactly one enum member matches after normalisation. The bool-guard in the exact-match phase correctly prevents True/False from silently collapsing onto integer members. Unrecognised or ambiguous values pass through to Pydantic unchanged, preserving existing validation behaviour. The new tests cover the three key drift categories plus the unhashable-value fallthrough path.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/column_generators/utils/judge_score_factory.py	Adds _normalize_score_value, _coerce_score_value, and a model_validator(mode='before') on BaseJudgeResponse to coerce LLM-returned score drift (numeric/string/case/whitespace) before Pydantic enum validation; unmatched values fall through to Pydantic unchanged.
packages/data-designer-engine/tests/engine/column_generators/utils/test_judge_score_factory.py	Adds four new tests covering int→string, string→int, and case/whitespace coercion via parametrize; nested structured-output coercion; and unhashable-score fallthrough to Pydantic ValidationError.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["LLM returns score value"] --> B["coerce_score model_validator mode=before"]
    B --> C{"data is dict with score?"}
    C -- No --> Z["Pass through unchanged"]
    C -- Yes --> D{"score field is Enum type?"}
    D -- No --> Z
    D -- Yes --> E["_coerce_score_value(value, enum_type)"]
    E --> F{"Exact match with bool guard?"}
    F -- Yes --> G["Return original value"]
    F -- No --> H["_normalize_score_value\nstrip + casefold, float-to-int"]
    H --> I{"Exactly 1 member matches normalized value?"}
    I -- Yes --> J["Return matched member.value"]
    I -- No --> K["Return original value\nambiguous or unrecognised"]
    G --> L["Pydantic validates against enum"]
    J --> L
    K --> L
    L --> M{"Valid enum value?"}
    M -- Yes --> N["Model instance stored value"]
    M -- No --> O["ValidationError raised"]

_{Reviews (1): Last reviewed commit: "fix: coerce judge score drift" | Re-trigger Greptile}

github-actions · 2026-06-24T10:04:29Z

Stale PR reminder

This PR has had failing checks for 7 days without activity.

Failing checks: check

Please push an update or leave a comment if you're still working on this.
Otherwise, this PR will be automatically closed in 7 days.

To prevent auto-close, add the keep-open label.

andreatgretel · 2026-06-24T13:25:05Z

Thanks for the contribution, this is a useful bit of tolerance around judge outputs. I reviewed the score coercion path and the generated Pydantic models. The implementation is nicely scoped and I don't see major blockers, but I'd like a small polish pass before merge.

A couple of test cases would make the new fallback behavior clearer:

Add a case for float drift into string score options, e.g. options {"1": "Low quality"} with model output 1.0 should coerce to "1".
Add a case for an out-of-range scalar like 99 to confirm it still falls through to Pydantic validation rather than being coerced.

Also, please add a short comment above the bool guard in _coerce_score_value(). It's there because bool is a subclass of int in Python, so True == 1 and False == 0; that context will help keep the guard from looking accidental.

Focused tests and smoke checks passed locally. Once those small coverage/readability items are in, this looks good to merge from my side.

fix: coerce judge score drift

2b3c629

Signed-off-by: schultzjack <schultzjack@users.noreply.github.com>

schultzjack requested a review from a team as a code owner June 17, 2026 00:31

github-actions Bot mentioned this pull request Jun 22, 2026

Agentic CI: Issue & PR Triage Tracker #562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: coerce judge score drift#756

fix: coerce judge score drift#756
schultzjack wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
schultzjack:codex/569-coerce-judge-scores

schultzjack commented Jun 17, 2026 •

edited by andreatgretel

Loading

Uh oh!

github-actions Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026

Confidence Score: 5/5

Flowchart

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

schultzjack commented Jun 17, 2026 • edited by andreatgretel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Testing

Uh oh!

github-actions Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

schultzjack commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

github-actions Bot commented Jun 24, 2026

Stale PR reminder

Uh oh!

andreatgretel commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schultzjack commented Jun 17, 2026 •

edited by andreatgretel

Loading

github-actions Bot commented Jun 17, 2026 •

edited

Loading